PHAST: Spoken Document Retrieval Based on Sequence Alignment

نویسندگان

  • Pere R. Comas
  • Jordi Turmo
چکیده

This paper presents a new approach to spoken document information retrieval for spontaneous speech corpora. Classical approach to this problem is the use of an automatic speech recognizer (ASR) combined with standard information retrieval techniques, based on terms or n-grams. However, state-of-the-art large vocabulary continuous ASRs produce transcripts of spontaneous speech with a word error rate of 25% or higher, which is a drawback for retrieval techniques based on terms or n-grams. In order to overcome such a limitation, our method is based on a sequence alignment algorithm drawn from the field of bioinformatics to search “sounds like” sequences in the document collection. These matching sequences are potentially misrecognized words from the ASR and can be used to retrieve relevant passages and documents from the collection. Our approach doesn’t depend on extra information provided by the ASR. We have evaluated and compared our approach to others in the state of the art in both spoken document retrieval and spoken passage retrieval tasks. The evaluation has been performed in the context of Question Answering using a corpus of automatic transcripts from the Spanish and European parliaments. The results show that our method outperforms by 10 points traditional term based search and n-gram search on automatic transcripts.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spoken Document Retrieval Based on Approximated Sequence Alignment

This paper presents a new approach to spoken document information retrieval for spontaneous speech corpora. The classical approach to this problem is the use of an automatic speech recognizer (ASR) combined with standard information retrieval techniques. However, ASRs tend to produce transcripts of spontaneous speech with significant word error rate, which is a drawback for standard retrieval t...

متن کامل

PhAST: Pharmacophore alignment search tool

We present a ligand-based virtual screening technique (PhAST) for rapid hit and lead structure searching in large compound databases. Molecules are represented as strings encoding the distribution of pharmacophoric features on the molecular graph. In contrast to other text-based methods using SMILES strings, we introduce a new form of text representation that describes the pharmacophore of mole...

متن کامل

Document Image Retrieval Based on Keyword Spotting Using Relevance Feedback

Keyword Spotting is a well-known method in document image retrieval. In this method, Search in document images is based on query word image. In this Paper, an approach for document image retrieval based on keyword spotting has been proposed. In proposed method, a framework using relevance feedback is presented. Relevance feedback, an interactive and efficient method is used in this paper to imp...

متن کامل

Text-based similarity searching for hit- and lead-candidate identification

The Pharmacophore Alignment Search Tool (PhAST) is a string-based approach to virtual screening. Molecules are represented by linear sequences which describe their respective pattern of interaction possibilities. The problem of molecule linearization is tackled by applying Minimum Volume Embedding in combination with a Diffusion Kernel to the molecular graph [1,2]. Linear representations are co...

متن کامل

Package 'rphast' Title R Interface to Phast Software for Comparative Genomics

December 13, 2013 Copyright The code in src/pcre is Copyright (c) 1997-2010 University of Cambridge. All other code is Copyright (c) 2002-2010 University of California, Cornell University. Maintainer Melissa Hubisz License BSD_3_clause + file LICENSE Title R interface to PHAST software for comparative genomics Author Melissa Hubisz, Katherine Pollard, and Adam Siepel Desc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008